Neural Style Transfer using DenseNet
Posted on Sun 22 July 2018 in ml
This iPython notebook is an implementation of a popular paper (Gatys et al., 2015) that demonstrates how to use neural networks to transfer artistic style from one image onto another.
The implementation is slightly modified to use Densenet instead of VGG-Net and an additional regularization term is also added to the overall loss function.
The main objective of the algorithm is to merge two images, namely content image (C) and style image (S) to create a generated image (G) , which combines the content of the image C with the style of the image S.
For example, we can see the stylized image from the Taj Mahal (content image C), mixed with a painting by Van Gogh (style image S).

The algorithm uses neural representations which are obtained by using Convolution Neural Networks (CNN), which are most powerful in image processing tasks. A CNN consists of a number of convolutional and subsampling layers optionally followed by fully connected layers, where each layer can be understood as a collection of image filters, each of which extracts a certain feature from the input image also called as feature maps.
On training CNN for object recognition, along the processing hierarchy of the network, the feature maps represents more actual content of the image compared to its detailed pixel values. This can be visualized by reconstructing the image only from feature maps. This is called content reconstruction.
By including the feature correlations of multiple layers, we obtain a style representation of the input image, which captures its texture information but not the global arrangement. This representation captures general appearance in terms of colour and localised structures. This can be visualized by reconstructiing the image from feature maps showing texture information. This is called style reconstruction.
Key notes from the paper¶
- The representations of content and style in the CNN are separable.
- Both representations can be independently manipulated to produce new, perceptually meaningful images.
- Based on a pre-trained VGG-Network with 16 convolutional layers (excluding 3-fully connected layers) of the 19 layer VGG-Network .
- Uses average pooling instead of max pooling at the pooling layers as it improves gradient flow and give better results.
- To obtain a representation of the style of an input image, we employ correlations between the different filter responses over the spatial extent of the feature maps.
Style transfer is posed as an optimization problem: $$g^* = \operatorname*{argmin}_g (\alpha L_{content}(c,g) + \beta L_{style}(s,g))$$
- c = content image
- s = style image
- g = combination image
- $L_{content}$ = content loss function
- $L_{style}$ = style loss function
- $\alpha$ = content weight
- $\beta$ = style weight
Above means that we are looking for an image g that differs very little in terms of content from image c, while simultaneously differing as little as possible in terms of style from image s.
- There are hyper-parameters $\alpha$ and $\beta$ to tune on how much of content & style is required in the final generated image.
- Normal VGG takes an image and returns a catergory score, but this paper instead takes the output at intermediate layers and construct $L_{content}$ and $L_{style}$.
- Hence, the main objective of the algorithm is to reduce above loss functions and produce a final image with style imposed on it.
- Uses L-BFGS as a valid quasi-Newton approach to solve the optimization problem.
Algorithm details & loss functions¶
- It starts with using a pre-trained VGG-Network with 16 layers excluding fully connected layers.
- Content image, style image and generated image are passed to the network.
Content loss at layer l is calculated as follows with respect to content image (c) and generated image (g): $$L_{content}(c,g,l) = \frac12 \sum_{i,j} (F^l_{ij} - P^l_{ij})^2$$
- $P^l$ and $F^l$ are feature representation at layer l for content image (c) and generated image (g) respectively
- Above squared-error loss between two feature representations is minimized using standard error back-propagation.
- Initially g is a white noise image (randomly initialised image) and it is changed during back-propagation until it generates the same response in a certain layer l of the CNN as the original image c.
- Hence we must choose a layer which captures high-level content in terms of objects and their arrangement in the input image but should not constrain the exact pixel values of the reconstruction. This can be done by performing content reconstruction by using above loss function.
Style loss is slightly tricky and is calculated as follows with respect to style image (s) and generated image (g):
- Feature correlations are calculated by the Gram matrix $G^l$, where $G^l_{ij}$ is the inner product between the vectorised feature map i and j in layer l: $$G^l_{ij} = \sum_{k} F^l_{ik} F^l_{jk}$$
- Intially g is a white noise image, we then use gradient descent to update g so that it matches the style of image s. This is done by minimising the mean-squared distance between the entries of the Gram matrix from the style image (s) and the Gram matrix of the image to be generated (g).
- The loss at a specific layer l is calculated as follows:
$$E_l = \frac1{4N^2_lM^2_l} \sum_{i,j} (G^l_{ij} - A^l_{ij})^2$$
- N = number of channels
- M = size of the image (height * width)
- $G^l_{ij}$ = style representation (gram matrix) for style image (s)
- $A^l_{ij}$ = style representation (gram matrix) for generated image (g)
- Loss function $E_l$ is only from a specifc layer l. But it is seen that when we combine textures (styles) from multiple layers we get better results. Hence overall style loss function is as follows:
$$L_{style}(s,g) = \sum_{l=0}^{L} w_lE_l$$
- $w_l$ is weighting factors of the contribution of each layer to the overall loss. it is seen in paper to be set as $\frac{1}{num-of-layers-considered-in-style-loss-function}$
- Layers are selected by performing style reconstruction by using above loss function.
- Using content reconstruction and style reconstruction we can visually select layers. Using the feature-maps at appropriate layers we minimise overall loss function to generate final image.
- Hence here image generation is posed as an optimisation problem.
Modifications to the approach¶
- Implementation in this notebook is inspired from VGG16 Style transfer implementation, where author uses VGG-Network with 16 layers as proposed in the paper.
- In this notebook, we will be using DenseNet (Densely connected CNN), which has better state of the art results on most popular datasets.
- We use Keras for implementing the algorithm. Also, Keras supports following densenet models:
- DenseNet121
- DenseNet169
- DenseNet201
- We will be using DenseNet121 so that we can save some computation time and also have quick results. Higher models may give better results, which will be experimented later.
- To the overall loss function, we will also add total variation loss as it helps in reducing the noise in the final image and is also a regularisation term encouraging spatial smoothness.
Implementation of the artistic style transfer algorithm¶
- We start by importing required modules.
- It can be seen below that we are importing DenseNet121 which comes with keras package.
- We also import few additional modules like requests, io, Ipython to download images from internet and also to display them.
from __future__ import print_function
import time
from PIL import Image
import numpy as np
from keras import backend
from keras.models import Model
from keras.applications.densenet import DenseNet121, preprocess_input
from scipy.optimize import fmin_l_bfgs_b
from scipy.misc import imsave
import requests
from io import BytesIO
from cStringIO import StringIO
import IPython.display
Load and preprocess the content and style images¶
- Our first task is to load the content and style images.
- Note that due to memory constraints, we use
224 x 224size image to choose appropriate layers for content and style representations. - Once the layers and hyper-parameters are finalised, we can then pass higher size image to get better view of the output.
height = 244
width = 244
img_url = "https://upload.wikimedia.org/wikipedia/commons/c/c8/Taj_Mahal_in_March_2004.jpg"
response = requests.get(img_url)
content_image = Image.open(BytesIO(response.content))
content_image = content_image.resize((width, height))
content_image
img_url = "http://1.bp.blogspot.com/-yqh8zFJh_OU/UcDJKe-EYaI/AAAAAAAAS2I/Qb3bm2zwQu0/s1600/%5BIMG+SRC=%22URL%22+ALT=picasso+painting,+romancing+picasso+painting,+colorful+oil+painting,k+madison+moore+artist%5D++copy.jpg"
response = requests.get(img_url)
style_image = Image.open(BytesIO(response.content))
style_image = style_image.resize((width, height))
style_image
Then, we convert these images into a form suitable for numerical processing. In particular, we add another dimension (beyond the classic height x width x 3 dimensions) so that we can later concatenate the representations of these two images into a common data structure.
content_array = np.asarray(content_image, dtype='float32')
content_array = np.expand_dims(content_array, axis=0)
print(content_array.shape)
style_array = np.asarray(style_image, dtype='float32')
style_array = np.expand_dims(style_array, axis=0)
print(style_array.shape)
Now we're ready to use these arrays to define variables in Keras' backend (the TensorFlow graph). We also introduce a placeholder variable to store the combination image (generated image) that retains the content of the content image while incorporating the style of the style image.
content_image = backend.variable(content_array)
style_image = backend.variable(style_array)
combination_image = backend.placeholder((1, height, width, 3))
Finally, we concatenate all this image data into a single tensor that's suitable for processing by Keras' DenseNet121 model.
input_tensor = backend.concatenate([content_image,
style_image,
combination_image], axis=0)
Reuse a model pre-trained for image classification to define loss functions¶
The core idea introduced by Gatys et al. (2015) is that convolutional neural networks (CNNs) pre-trained for image classification already know how to encode perceptual and semantic information about images. We're going to follow their idea, and use the feature spaces provided by one such model to independently work with content and style of images.
The original paper uses the 19 layer VGG network model from Simonyan and Zisserman (2015), but we're going to instead use DenseNet121 model.
Also, since we're not interested in the classification problem, we don't need the fully connected layers or the final softmax classifier. We only need the part of the model before "Classification Layer" from below architectures:

As seen above, for DenseNet-121 there are 121 layers (1 right after Input Layer, 116 from Dense Block, 3 from Transition Layer & 1 FC Layer). Note: Pooling is not considered as a layer.
It is trivial for us to get access to this truncated model because Keras comes with a set of pretrained models, including the DenseNet121 model we're interested in. Note that by setting include_top=False in the code below, we don't include any of the fully connected layers.
model = DenseNet121(input_tensor=input_tensor, weights='imagenet',
include_top=False)
As it is clear from the table above, the model we're working with has a lot of layers. Keras has its own names for these layers. Let's make a list of these names so that we can easily refer to individual layers later.
layers = dict([(layer.name, layer.output) for layer in model.layers])
layers
If you stare at the list above, you'll convince yourself that we covered all items we wanted in the table. Notice also that because we provided Keras with a concrete input tensor, the various TensorFlow tensors get well-defined shapes.
The crux of the paper we're trying to reproduce is that the style transfer problem can be posed as an optimisation problem, where the loss function we want to minimise can be decomposed into three distinct parts: the content loss, the style loss and the total variation loss.
- We introduce a function called
visualize_img, which will help us visualise image during content and style reconstruction. - We also introduce an
Evaluatorclass that computes loss and gradients in one pass while retrieving them via two separate functions,lossandgrads. This is done becausescipy.optimizerequires separate functions for loss and gradients, but computing them separately would be inefficient. Evaluatorclass is defined here so that this can be used for content and style reconstruction later.
import copy
def visualise_img(x, h, w):
x_p = copy.deepcopy(x)
x_p = x_p.reshape((h, w, 3))
x_p = np.clip(x_p, 0, 255).astype('uint8')
f = StringIO()
Image.fromarray(x_p).save(f, 'png')
IPython.display.display(IPython.display.Image(data=f.getvalue()))
class Evaluator(object):
def __init__(self, f_outputs, height, width):
self.loss_value = None
self.grads_values = None
self.f_outputs = f_outputs
self.height = height
self.width = width
def eval_loss_and_grads(self, x):
x = x.reshape((1, self.height, self.width, 3))
outs = self.f_outputs([x])
loss_value = outs[0]
grad_values = outs[1].flatten().astype('float64')
return loss_value, grad_values
def loss(self, x):
assert self.loss_value is None
loss_value, grad_values = self.eval_loss_and_grads(x)
self.loss_value = loss_value
self.grad_values = grad_values
return self.loss_value
def grads(self, x):
assert self.loss_value is not None
grad_values = np.copy(self.grad_values)
self.loss_value = None
self.grad_values = None
return grad_values
The content loss¶
The content loss is the (scaled, squared) Euclidean distance between feature representations of the content and combination images.
def content_loss(content, combination):
return backend.sum(backend.square(combination - content)) / 2
Selecting layer for content representation¶
For the content loss, we see that in Gatys et al. (2015) the choice of layer is made based on the ability of the feature-map to reconstruct image which preserves the high-level content of the original image but loses the exact pixel information.
For example:

- It can be seen that
relu5_1in above figure is a perfect choice. - So we'll have to find a layer which gives same result as above choice.
- We define below function
content_image_reconstruction, which reconstructs image from a specific layer. - In content reconstruction, we take content image and generated (combination) image and try to minimise content loss. It runs for 20 iterations and the gradients updates the combination image. After 20 iterations, we get reconstructed image.
- We stop after 20 iterations, because the loss stops reducing significantly.
def content_image_reconstruction(layer_name):
layer_features = layers[layer_name]
content_image_features = layer_features[0, :, :, :]
combination_features = layer_features[2, :, :, :]
loss = content_loss(content_image_features,
combination_features)
grads = backend.gradients(loss, combination_image)
outputs = [loss]
outputs += grads
f_outputs = backend.function([combination_image], outputs)
evaluator = Evaluator(f_outputs, height, width)
# white noise image
x = np.random.uniform(0, 255, (1, height, width, 3))
iterations = 20
# print layer-name & number of iterations
title = layer_name + " (%d-iters):" % (iterations)
print(title)
print('='*len(title))
for i in range(iterations):
start_time = time.time()
x, min_val, info = fmin_l_bfgs_b(evaluator.loss, x.flatten(),
fprime=evaluator.grads, maxfun=20)
end_time = time.time()
# print loss value & time taken in iteration i
# format: loss(time-in-seconds)
print("%.2f(%ds)," % (min_val, end_time - start_time), end='')
# print new line so that we don't exceed way out of screen
if (i % 10 == 9):
print()
print()
visualise_img(x, height, width)
- Using above function we can go through each layer and figure out the image required for content representation.
- But since there are so many layers to go through, we first start with selecting one layer from each convX-block1-bn, where X is ranges from 1 to 5.
- Reconstruction function outputs loss & time taken in each iteration in this format:
loss(time-in-secs), ...along with reconstructed image.
layer_names = ['conv1/bn', 'conv2_block1_0_bn', 'conv3_block1_0_bn',
'conv4_block1_0_bn', 'conv5_block1_0_bn']
for cnn_layer in layer_names:
content_image_reconstruction(cnn_layer)
- We can see above that
conv1&conv2gives better image. And other layers are losing too much information. - It can also be seen that
conv1is not losing pixel information and hence we drop looking at that layer as well. And go through other layers with prefixconv1 - Hence, we now go through each intial block of all the layers starting with
conv2.
layer_names = ['conv1/conv', 'conv1/relu', 'conv2_block1_0_bn',
'conv2_block2_0_bn', 'conv2_block3_0_bn',
'conv2_block4_0_bn', 'conv2_block5_0_bn', 'conv2_block6_0_bn']
for cnn_layer in layer_names:
content_image_reconstruction(cnn_layer)
- It can be seen above that
block 5,6are also loosing enough information. And hence we now go through all the layers with prefixconv1/relu,conv2_block1,conv2_block2,conv2_block3,conv2_block4 - Due to memory constraints, below code is commented and only shows those images which can be candidate content representation layer.
layer_names = ['conv2_block1_0_bn', 'conv2_block1_0_relu', 'conv2_block1_1_bn',
'conv2_block1_1_conv', 'conv2_block1_1_relu',
'conv2_block1_2_conv', 'conv2_block1_concat',
'conv2_block2_0_bn', 'conv2_block2_0_relu', 'conv2_block2_1_bn',
'conv2_block2_1_conv', 'conv2_block2_1_relu',
'conv2_block2_2_conv', 'conv2_block2_concat',
'conv2_block3_0_bn', 'conv2_block3_0_relu', 'conv2_block3_1_bn',
'conv2_block3_1_conv', 'conv2_block3_1_relu',
'conv2_block3_2_conv', 'conv2_block3_concat',
'conv2_block4_0_bn', 'conv2_block4_0_relu',
'conv2_block4_1_bn', 'conv2_block4_1_conv',
'conv2_block4_1_relu', 'conv2_block4_2_conv',
'conv2_block4_concat']
# for cnn_layer in layer_names:
# content_image_reconstruction(cnn_layer)
candidate_layer_names = ['conv1/relu', 'conv2_block1_0_bn',
'conv2_block3_0_relu', 'conv2_block4_0_relu']
for cnn_layer in candidate_layer_names:
content_image_reconstruction(cnn_layer)
- It can be inferred from above that
conv2_block1_0_bn,conv2_block3_0_relu,conv2_block4_0_relucan be good choice.
The style loss¶
This is where things start to get a bit intricate.
For the style loss, we first define something called a Gram matrix. The terms of this matrix are proportional to the covariances of corresponding sets of features, and thus captures information about which features tend to activate together. By only capturing these aggregate statistics across the image, they are blind to the specific arrangement of objects inside the image. This is what allows them to capture information about style independent of content. (This is not trivial at all, and I refer you to a paper that attempts to explain the idea.)
The Gram matrix can be computed efficiently by reshaping the feature spaces suitably and taking an outer product.
def gram_matrix(x):
features = backend.batch_flatten(backend.permute_dimensions(x, (2, 0, 1)))
gram = backend.dot(features, backend.transpose(features))
return gram
The style loss is then the (scaled, squared) Frobenius norm of the difference between the Gram matrices of the style and combination images.
def style_loss(style, combination, height, width):
S = gram_matrix(style)
C = gram_matrix(combination)
channels = 3
size = height * width
return backend.sum(backend.square(S - C)) / (4. * (channels ** 2) * (size ** 2))
Selecting layers for style representation¶
For style loss, we see that in Gatys et al. (2015) they select layers which reconstructs images that match the style of a given image on an increasing scale while discarding information of the global arrangement of the scene.
For example:

- It can be seen in above figure how textures in selected layers should look like and also it is seen that on merging those selected textures we get a much better generalised texture.
- So we'll have to find one ore more layers which gives similar result as above.
- We define below function
style_image_reconstruction, which reconstructs image from a specific layer. - In style reconstruction, we take style image and generated (combination) image and try to minimise style loss. It runs for 20 iterations and the gradients updates the combination image. After 20 iterations, we get reconstructed image.
- Like content reconstruction, even here, we stop after 20 iterations, because the loss stops reducing significantly.
def style_image_reconstruction(layer_names):
loss = backend.variable(0.)
for layer_name in layer_names:
layer_features = layers[layer_name]
style_features = layer_features[1, :, :, :]
combination_features = layer_features[2, :, :, :]
loss += style_loss(style_features, combination_features, height, width)
grads = backend.gradients(loss, combination_image)
outputs = [loss]
outputs += grads
f_outputs = backend.function([combination_image], outputs)
evaluator = Evaluator(f_outputs, height, width)
# white noise image
x = np.random.uniform(0, 255, (1, height, width, 3))
iterations = 20
# print layer-name & number of iterations
title = ",".join(layer_names) + " (%d-iters):" % (iterations)
print(title)
print('='*len(title))
for i in range(iterations):
start_time = time.time()
x, min_val, info = fmin_l_bfgs_b(evaluator.loss, x.flatten(),
fprime=evaluator.grads, maxfun=20)
end_time = time.time()
# print loss value & time taken in iteration i
# format: loss(time-in-seconds)
print("%.2f(%ds)," % (min_val, end_time - start_time), end='')
# print new line so that we don't exceed way out of screen
if (i % 10 == 9):
print()
print()
visualise_img(x, height, width)
- Using above function we can go through each layer and figure out the image required for style representation.
- But since there are so many layers to go through, we first start with selecting one layer from each convX-block1-bn, where X ranges from 1 to 5. (similar to what we did for content image reconstruction)
- Reconstruction function outputs loss & time taken in each iteration in this format:
loss(time-in-secs), ...along with reconstructed image.
layer_names = [('conv1/bn',), ('conv2_block1_0_bn',), ('conv3_block1_0_bn',),
('conv4_block1_0_bn',), ('conv5_block1_0_bn',)]
for cnn_layer in layer_names:
style_image_reconstruction(cnn_layer)
- We can see above that comparatively
conv2gives better texture. And other layers are losing too much information. - Usually in concat layers we will have better texture information from previous layers and hence, we will go through all the blocks of the layers starting with
conv2. - Due to memory constraints, below code is commented and only shows those images which can be candidate style representation layer.
layer_names = [('conv2_block1_0_bn',), ('conv2_block1_0_relu',),
('conv2_block1_1_bn',), ('conv2_block1_1_conv',),
('conv2_block1_1_relu',), ('conv2_block1_2_conv',),
('conv2_block1_concat',), ('conv2_block2_0_bn',),
('conv2_block2_0_relu',), ('conv2_block2_1_bn',),
('conv2_block2_1_conv',), ('conv2_block2_1_relu',),
('conv2_block2_2_conv',), ('conv2_block2_concat',),
('conv2_block3_0_bn',), ('conv2_block3_0_relu',),
('conv2_block3_1_bn',), ('conv2_block3_1_conv',),
('conv2_block3_1_relu',), ('conv2_block3_2_conv',),
('conv2_block3_concat',), ('conv2_block4_0_bn',),
('conv2_block4_0_relu',), ('conv2_block4_1_bn',),
('conv2_block4_1_conv',), ('conv2_block4_1_relu',),
('conv2_block4_2_conv',), ('conv2_block4_concat',),
('conv2_block5_0_bn',), ('conv2_block5_0_relu',),
('conv2_block5_1_bn',), ('conv2_block5_1_conv',),
('conv2_block5_1_relu',), ('conv2_block5_2_conv',),
('conv2_block5_concat',), ('conv2_block6_0_bn',),
('conv2_block6_0_relu',), ('conv2_block6_1_bn',),
('conv2_block6_1_conv',), ('conv2_block6_1_relu',),
('conv2_block6_2_conv',), ('conv2_block6_concat',)]
# for cnn_layer in layer_names:
# style_image_reconstruction(cnn_layer)
layer_names = [('conv2_block1_concat',), ('conv2_block2_concat',),
('conv2_block3_concat',), ('conv2_block4_concat',),
('conv2_block5_concat',), ('conv2_block6_concat',)]
for cnn_layer in layer_names:
style_image_reconstruction(cnn_layer)
- Now we merge all the layers selected from above
layer_names = [('conv2_block1_concat', 'conv2_block2_concat',
'conv2_block3_concat', 'conv2_block4_concat',
'conv2_block5_concat', 'conv2_block6_concat',)]
for cnn_layer in layer_names:
style_image_reconstruction(cnn_layer)
Note: As it can be seen in above reconstructed style image that the texture is not yet fully concise and hence the output combination image is not that of good quality as compared to the results in original paper.
The total variation loss¶
Now we're back on simpler ground.
If you were to solve the optimisation problem with only the two loss terms we've introduced so far (style and content), you'll find that the output is quite noisy. We thus add another term, called the total variation loss (a regularisation term) that encourages spatial smoothness.
You can experiment with reducing the total_variation_weight and play with the noise-level of the generated image.
Total variation loss works as regularizer for smoothing the generated image.
def total_variation_loss(x, height, width):
a = backend.square(x[:, :height-1, :width-1, :] - x[:, 1:, :width-1, :])
b = backend.square(x[:, :height-1, :width-1, :] - x[:, :height-1, 1:, :])
return backend.sum(backend.pow(a + b, 1.25))
Final loss calculation¶
As seen earlier, Style transfer is posed as an optimization problem with following loss function: $$L_{total}(c,s,g) = \alpha L_{content}(c,g) + \beta L_{style}(s,g)$$
We also introduce total variation loss and hence following is the overall loss function: $$L_{total}(c,s,g) = \alpha L_{content}(c,g) + \beta L_{style}(s,g) + \gamma L_{tv}(g)$$
def style_transfer_loss(layers, content_weight, style_weight, tv_weight,
content_layer_name, style_layer_names, height, width):
layer_features = layers[content_layer_name]
content_image_features = layer_features[0, :, :, :]
combination_features = layer_features[2, :, :, :]
# alpha * L_content
final_content_loss = content_weight * content_loss(content_image_features,
combination_features)
final_style_loss = backend.variable(0.)
feature_layers = style_layer_names
for layer_name in feature_layers:
layer_features = layers[layer_name]
style_features = layer_features[1, :, :, :]
combination_features = layer_features[2, :, :, :]
sl = style_loss(style_features, combination_features, height, width)
# beta * L_style
final_style_loss += (style_weight / len(feature_layers)) * sl
# gamma * L_tv
final_tv_loss = tv_weight * total_variation_loss(combination_image, height,
width)
# L_total
final_loss = final_content_loss + final_style_loss + final_tv_loss
return final_loss
We'll now use the feature spaces provided by specific layers of our model to define these three loss functions:
- Content loss = final_content_loss
- Style loss = final_style_loss
- Total variation loss = final_tv_loss
The relative importance of loss terms are determined by a set of scalar weights. These are arbitrary, but the following set have been chosen after quite a bit of experimentation to find a set that generates output that's aesthetically pleasing to me.
content_weight = 4
style_weight = 500000
tv_weight = 0.0001
content_layer_name = 'conv2_block3_0_relu'
style_layer_names = ['conv2_block1_concat', 'conv2_block2_concat',
'conv2_block3_concat', 'conv2_block4_concat',
'conv2_block5_concat', 'conv2_block6_concat']
final_loss = style_transfer_loss(layers, content_weight, style_weight,
tv_weight, content_layer_name,
style_layer_names, height, width)
Define needed gradients and solve the optimisation problem¶
The goal of this journey was to setup an optimisation problem that aims to solve for a combination image that contains the content of the content image, while having the style of the style image. Now that we have our input images massaged and our loss function calculators in place, all we have left to do is define gradients of the total loss relative to the combination image, and use these gradients to iteratively improve upon our combination image to minimise the loss.
We start by defining the gradients.
grads = backend.gradients(final_loss, combination_image)
outputs = [final_loss]
outputs += grads
f_outputs = backend.function([combination_image], outputs)
Now we're finally ready to solve our optimisation problem. This combination image begins its life as a random collection of (valid) pixels, and we use the L-BFGS algorithm (a quasi-Newton algorithm that's significantly quicker to converge than standard gradient descent) to iteratively improve upon it.
We stop after 80 iterations because the output looks good to me and the loss stops reducing significantly.
# Subtracting with 128. gradients converge faster
x = np.random.uniform(0, 255, (1, height, width, 3)) - 128.
iterations = 80
evaluator = Evaluator(f_outputs, height, width)
# print number of iterations
title = "%d-iters [format: loss(time-in-seconds)]:" % (iterations)
print(title)
print('='*len(title))
for i in range(iterations):
start_time = time.time()
x, min_val, info = fmin_l_bfgs_b(evaluator.loss, x.flatten(),
fprime=evaluator.grads, maxfun=20)
end_time = time.time()
# print loss value & time taken in iteration i
# format: loss(time-in-seconds)
print("%.2f(%ds)," % (min_val, end_time - start_time), end='')
# print new line so that we don't exceed way out of screen
if ((i+1) % 5 == 0):
print()
print("Content Image")
visualise_img(content_array, height, width)
print("Style Image")
visualise_img(style_array, height, width)
print("Combination Image")
visualise_img(x, height, width)
Overall code¶
- Following is overall code as we saw from above.
- Due to memory constraints, we visualised only 244x244 images. Following can be used to estimate time taken by this algorithm and also to play with different images of different sizes for style transfer.
content_img_url = "https://upload.wikimedia.org/wikipedia/commons/c/c8/Taj_Mahal_in_March_2004.jpg"
style_img_url = "http://1.bp.blogspot.com/-yqh8zFJh_OU/UcDJKe-EYaI/AAAAAAAAS2I/Qb3bm2zwQu0/s1600/%5BIMG+SRC=%22URL%22+ALT=picasso+painting,+romancing+picasso+painting,+colorful+oil+painting,k+madison+moore+artist%5D++copy.jpg"
height = 512
width = 512
content_weight = 4
style_weight = 500000
total_variation_weight = 0.0001
content_layer_name = 'conv2_block3_0_relu'
style_layer_names = ['conv2_block1_concat', 'conv2_block2_concat',
'conv2_block3_concat', 'conv2_block4_concat',
'conv2_block5_concat', 'conv2_block6_concat']
response = requests.get(content_img_url)
content_image = Image.open(BytesIO(response.content))
content_image = content_image.resize((height, width))
response = requests.get(style_img_url)
style_image = Image.open(BytesIO(response.content))
style_image = style_image.resize((height, width))
content_array = np.asarray(content_image, dtype='float32')
content_array = np.expand_dims(content_array, axis=0)
style_array = np.asarray(style_image, dtype='float32')
style_array = np.expand_dims(style_array, axis=0)
content_image = backend.variable(content_array)
style_image = backend.variable(style_array)
combination_image = backend.placeholder((1, height, width, 3))
input_tensor = backend.concatenate([content_image,
style_image,
combination_image], axis=0)
model = DenseNet121(input_tensor=input_tensor, weights='imagenet',
include_top=False)
layers = dict([(layer.name, layer.output) for layer in model.layers])
final_loss = style_transfer_loss(layers, content_weight, style_weight,
total_variation_weight, content_layer_name,
style_layer_names, height, width)
grads = backend.gradients(final_loss, combination_image)
outputs = [final_loss]
outputs += grads
f_outputs = backend.function([combination_image], outputs)
# Subtracting with 128. gradients converge faster
x = np.random.uniform(0, 255, (1, height, width, 3)) - 128.
iterations = 80
evaluator = Evaluator(f_outputs, height, width)
# print number of iterations
title = "%d-iters [format: loss(time-in-seconds)]:" % (iterations)
print(title)
print('='*len(title))
for i in range(iterations):
start_time = time.time()
x, min_val, info = fmin_l_bfgs_b(evaluator.loss, x.flatten(),
fprime=evaluator.grads, maxfun=20)
end_time = time.time()
# print loss value & time taken in iteration i
# format: loss(time-in-seconds)
print("%.2f(%ds)," % (min_val, end_time - start_time), end='')
# print new line so that we don't exceed way out of screen
if ((i+1) % 5 == 0):
print()
print("Content Image")
visualise_img(content_array, height, width)
print("Style Image")
visualise_img(style_array, height, width)
print("Combination Image")
visualise_img(x, height, width)
Conclusion and further improvements¶
- Above output can be improved as the results are not comparable with one showed in paper.
- We can try different combination of layers for style representation to get better output. Also maybe assigning different weights to the chosen layer outputs would yield better results.
- We can also try higher DenseNet models.
- As beautiful as the output of this code can be, the process we use to generate it is quite slow. And no matter how much you speed this algorithm up (with GPUs and creative hacks), it is still going to be a relatively expensive problem to solve. This is because we're solving an entire optimisation problem by multiple forward and backward passes each time we want to generate an image.
- Try faster version of this algorithm (Johnson et al., 2016), where the optimisation problem will be replaced with an image transformation CNN, which in turn uses the VGG16 network as before to measure losses. When this transformation network is trained on many images given a fixed style image, we end up with a fully feed-forward CNN that we can apply for style transfer. This gives us a 1000x speed up over this implementation.
Compartive Studies¶
Perceptual Losses for Real-Time Style Transfer and Super-Resolution by Johnson et al., 2016¶
- The main contribution of the paper proposed by Johnson et. al is that feeding forward the generated image to a pre-trained image classification model and extracting the output from some intermediate layers to calculate losses would produce similar results of Gatys et al but with significantly less computational resources.
- Paper shows that if we limit ourselves to a single style image, we can train a neural network to solve this optimisation problem for us in real-time and transform any given content image into a styled version.
- Below gives a system overview as proposed in paper:

The system consists of two networks:
- Image transformation network - A multi-layer convolutional neural network which will transform input image to output image, where output has both content and style representation of content image and style image respectively. It consists of 3 layers of convolution and ReLU non-linearity, 5 residual blocks, 3 transpose convolutional layers and finally a non-linear tanh layer which produces an output image. Network performs downsampling using strided convolution to increase receptive field signficantly and then performs upsampling using fractionally strided convolution to have output image size same as input image.
- Loss network - Used to calculate loss between input image and generated image. It is calculated in same way as we saw in Gatys et al. (2015). In this paper, the representations are calculated using the VGG network, which is a network that has been pre-trained for object recognition. One can use other networks as well like Densenet as we did with the approach by Gatys et al. (2015).
Given this system, we can then train the image transformation network to reduce the total style transfer loss. To train the network, one can pick a fixed style image and use a large batch of different content images as training examples. In their paper, Johnson et. al trained their network on the Microsoft COCO dataset - which is an object recognition dataset of 80,000 different images.
Training involves using the loss network to evaluate the loss for a given training example and then propagating this error back through every layer in the image transformation network. So the first part of the structure is a “Image Transform Net” which generate new image from the input image. And the second part is simply a “Loss Network”, which is the feeding forward part.The weight of the loss network is fixed and will not be updated during training.
Testing involves passing the image through transformation network and after final layer it will produce a valid image which has style imposed on content image. This means the generated output must contain values that are all in the valid pixel range of 0 to 255. To achieve this there is tanh layer in transformation network which is scaled to have output in the range of 0 to 255.
As this requires enough compute power to train the network, we won't be implementing this here. The whole idea of describing this paper here is to see an alternative faster approach which can be used in real-time for style transfer. It can be seen above that during testing we are only doing single forward pass and hence the speed compared to the approach seen in Gatys et al. (2015).
Universal Style Transfer via Feature Transforms by Li et al., 2017¶
As seen in Gatys et al paper, the optimization-based methods can handle arbitrary styles with pleasing visual quality but at the expense of high computational costs. And as seen in Perceptual Loss function based style transfer paper that feed-forward approaches can be executed efficiently but are limited to a fixed number of styles or compromised visual quality.
This paper proposes a simple yet effective method for universal style transfer which achieves the style-agnostic generalization ability with marginally compromised visual quality and execution efficiency.
It formulates style transfer as an image reconstruction process coupled with feature transformation, i.e., whitening and coloring. The reconstruction part is responsible for inverting features back to the RGB space and the feature transformation matches the statistics of a content image to a style image.
Hence there are two parts to this approach:
Recontruction decoder:
- An auto-encoder network is constructed for general image reconstructiion. A pretained model (like VGG-19, from paper) is used as encoder and its weights are fixed to train a decoder network for inverting (VGG) features to the original image. Decoder is designed to be symmetrical to that of encoder network with the nearest neighbor upsampling layer used for enlarging feature maps.

- Evaluation involves selecting feature maps at different layers of the network and train decoders from different layers accordingly. The pixel reconstruction loss and feature loss are employed for reconstructing an input image: $$L = ||I_o - I_i||^2_2 + \lambda ||\phi(I_o) - \phi (I_i) ||^2_2$$
- where $I_i, I_o$ are the input image and reconstruction output, and $\phi$ is the encoder that extracts the features from selected layer. In addition, $\lambda$ is the weight to balance the two losses.
- Post training, the decoder is fixed (i.e. will not be fine-tuned) and is used as feature inverter.
- An auto-encoder network is constructed for general image reconstructiion. A pretained model (like VGG-19, from paper) is used as encoder and its weights are fixed to train a decoder network for inverting (VGG) features to the original image. Decoder is designed to be symmetrical to that of encoder network with the nearest neighbor upsampling layer used for enlarging feature maps.
Whitening and coloring transforms (WCT)
- Given a pair for content image and style image, first vectorized feature maps are extracted from certain layer, say $f_c$ and $f_s$ for content and style image feature map respectively. The decoder will reconstruct the original image if $f_c$ is directly fed into it. It then uses WCT to adjust $f_c$ with respect to the statistics of $f_s$. The goal of WCT is to directly transform $f_c$ to match the covariance matrix of $f_s$.
- Please refer to the paper for more information on WCT.
- Given a pair for content image and style image, first vectorized feature maps are extracted from certain layer, say $f_c$ and $f_s$ for content and style image feature map respectively. The decoder will reconstruct the original image if $f_c$ is directly fed into it. It then uses WCT to adjust $f_c$ with respect to the statistics of $f_s$. The goal of WCT is to directly transform $f_c$ to match the covariance matrix of $f_s$.
For improved results the paper also present a multi-level stylization pipeline, which takes all level of information of a style into account.
In short, the algorithm works by unfolding the image generation process via training an auto-encoder for image reconstruction. And then integrates the whitening and coloring transforms in the feed-forward passes to match the statistical distributions and correlations between the intermediate features of content and style.
Given time constraints and compute resources the code is not implemented for this algorithm. Also paper uses MS COCO dataset to train the decoder.
- As seen in this paper, it does not require learning for each individual style. And hence is a better algorithm as compared to other methods seen in this notebook.
References¶
- A neural algorithm of artistic style (2015) - Leon A. Gatys, Alexander S. Ecker, Matthias Bethge
- Densely Connected Convolutional Networks - Gao Huang, Zhuang Liu, Laurens van der Maaten, Kilian Q. Weinberger
- Perceptual Losses for Real-Time Style Transfer and Super-Resolution - Justin Johnson, Alexandre Alahi, Li Fei-Fei
- Artistic style transfer implementation with a repurposed VGG-Net-16 - H. Narayanan
- Fast Neural Style - Code - Justin Johnson et al.
- Universal Style Transfer via Feature Transforms - Li et al.
- Tensorflow/Keras implementation of WCT
- iPython Notebook of this post